Back

Cell Systems

Elsevier BV

Preprints posted in the last 30 days, ranked by how well they match Cell Systems's content profile, based on 167 papers previously published here. The average preprint has a 0.54% match score for this journal, so anything above that is already an above-average fit.

1
Why phylogenies compress so well: combinatorial guarantees under the Infinite Sites Model

Hendrychova, V.; Brinda, K.

2026-03-27 bioinformatics 10.64898/2026.03.18.712055 medRxiv
Top 0.1%
40.0%
Show abstract

One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the models predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.

2
Emergent Biological Realism in RL-Trained DNA Language Models

Thiel, M.; Cunningham, A.; Barnes, C. P.

2026-03-26 bioinformatics 10.64898/2026.03.24.713963 medRxiv
Top 0.1%
37.5%
Show abstract

Reinforcement learning has driven the mass adoption of large language models by unlocking unexpected capabilities, yet this approach remains largely underexplored for generative DNA models. We investigate whether similar post-training techniques can induce emergent biological realism in DNA language models, using plasmid generation as a testbed due to plasmids relative simplicity, well-characterized functional constraints, and ubiquity in biotechnology. Using Group Relative Policy Optimization with a reward function based on constraints from engineered biology, our model achieves a 77% quality control pass rate compared to 5% for the pretrained baseline. Remarkably, beyond explicitly optimized features, the model exhibits surprising biological parallels: generated sequences match natural plasmids in thermodynamic stability, codon usage patterns, and ORF length distributions, properties not explicitly optimized in the reward function. These results suggest that RL post-training can steer DNA language models toward biologically coherent regions of sequence space, analogous to how such techniques unlock unexpected capabilities in natural language models, particularly in verifiable domains.

3
Geometry-aware ligand-receptor analysis distinguishes interface association from spatial localization and reveals a continuum of tumor communication

Yepes, S.

2026-04-08 bioinformatics 10.64898/2026.04.06.716708 medRxiv
Top 0.1%
32.6%
Show abstract

Spatial transcriptomics enables inference of cell-cell communication through ligand-receptor (LR) interactions, but current prioritization strategies often rely on expression strength or interface-associated enrichment without explicitly modeling tissue geometry. As a result, interactions associated with population interfaces are frequently interpreted as spatially localized even when their underlying expression is broadly distributed. Here, we present a geometry-aware framework for LR prioritization that explicitly separates interface structure from spatial localization within a locked and reproducible analysis pipeline. We quantify interface-associated communication using a distance-weighted boundary score defined on a spatial neighbor graph, evaluate interface specificity using a label-permutation null model that preserves spatial geometry, and compute an LR-specific localization score that captures the proximity of ligand and receptor expression to the corresponding interface. This framework distinguishes interface-associated compatibility from interaction-level spatial concentration. Across spatial transcriptomics datasets from breast cancer, colorectal cancer, melanoma, and pancreatic ductal adenocarcinoma, interface-aware ranking consistently recovers pathway families associated with extracellular matrix, adhesion, inflammatory, and immune-related processes. However, interface enrichment frequently shows limited separation from the null model, indicating that interface structure alone does not establish spatial specificity. Incorporating geometric localization substantially alters LR prioritization, distinguishing interactions that are concentrated near interfaces from those that are more diffusely distributed. Under a fixed, deterministic pipeline applied identically across datasets without parameter tuning, discrete spatial communication regimes were not reproducibly recovered. Instead, variation across samples is more consistently captured as continuous differences in geometry-aware attenuation, reflecting the degree to which inferred interactions are spatially constrained by tissue architecture. Together, these results demonstrate that interface-associated enrichment and spatial localization are distinct properties of inferred LR interactions, and that accurate interpretation of spatial communication requires explicit modeling of tissue geometry. Under this framework, tumor communication is more consistently described as a continuum of spatial constraint.

4
eBiota: Designing microbial communities from large seed pools with desired function using rapid optimization and deep learning

Jiang, X.; Hou, J.; Zhang, H.; Guo, J.; Gu, S.; Vandeputte, D.; Liao, Y.; Guo, Q.; Yang, X.; Zhou, Y.; Geng, P. X.; Wang, C.; Li, M.; Jousset, A.; Shen, X.; Wei, Z.; Zhu, H.

2026-03-31 bioengineering 10.64898/2026.03.29.714676 medRxiv
Top 0.1%
28.4%
Show abstract

Designing microbial communities to generate target products is crucial for biotechnology, agriculture, and disease treatment. However, rationally designing such communities from large seed pools has become a major challenge, as the rapidly expanding number of complete microbial genomes greatly expands the search space and sharply increases the required screening time and computational cost. Here, we introduce eBiota, a platform for ab initio design of microbial communities from a pool of 21,514 strains to generate target products. eBiota not only identifies optimal strain combinations but also simulates community behaviors, including microbial interactions and relative abundances. eBiota integrates three modules: CoreBFS, a graph-based search algorithm that rapidly screens for bacteria with complete metabolic pathways related to the target product; ProdFBA, an extended flux balance analysis that identifies microbial consortia with maximal production efficiency; and DeepCooc, a deep learning model trained on 23,323 microbiome samples across various environments to infer co-occurrence patterns. We validated eBiotas capabilities in microbial community design and production efficiency calculation using public microbiome datasets, ranging from single strains to six-member consortia. Further in vitro experiments involving 94 strains confirmed eBiotas ability to identify species that inhibit pathogen growth and to accurately model the relative abundances within complex microbial communities. As an initial digital twin, eBiota provides a powerful platform for the rational design of functional microbial communities, offering new opportunities for metabolic engineering and synthetic biology.

5
Single-cell Transcriptomic Variance Analysis Reveals Intercellular Circadian Desynchrony in the Alzheimer's Affected Human Brain

Hollis, H. C.; Veltri, A.; Korac, K.; Menon, V.; Bennett, D. A.; Ronnekleiv-Kelly, S.; Kim, J.; Anafi, R. C.

2026-03-25 bioinformatics 10.64898/2026.03.23.713759 medRxiv
Top 0.1%
28.2%
Show abstract

Bulk tissue rhythms arise from the coordination of thousands of individual cellular oscillations. Bulk rhythm amplitude differences may reflect changes in the amplitude of the underlying cellular oscillators or changes in their temporal coherence. To resolve this fundamental ambiguity, we developed ORPHEUS (Oscillatory Rhythm Phase Heterogeneity Estimated Using Statistical-moments), an analytical method that quantifies cellular desynchrony by leveraging the unique 12hr rhythmic signature it imparts on intercellular expression variance. After validating ORPHEUS in silico and on data from the mouse suprachiasmatic nucleus (SCN), we applied it to data from the mouse liver and human brain to uncover disease- and pathway-related differences in intercellular synchrony. In both tissues, we found that circadian synchrony is higher in cells and samples with higher MTORC activity. Most critically, we observed a dramatic loss of cellular synchrony in excitatory neurons from subjects with Alzheimers Disease (AD) dementia. By decoupling the influence of cellular amplitude and synchrony, ORPHEUS introduces a new, interpretable tool for analyzing circadian coordination in time-course single-cell data.

6
Dissecting the Black Box of AlphaFold in Protein-Protein Complex Assembly

Li, S.; Mu, Z.; Yan, C.

2026-04-06 bioinformatics 10.64898/2026.04.03.716280 medRxiv
Top 0.1%
27.7%
Show abstract

AlphaFold achieves unprecedented accuracy in modeling protein-protein complexes, yet the principles governing complex assembly remain unclear. Here, we develop a unified interpretability framework for AlphaFold-Multimer and AlphaFold3 to dissect the mechanisms underlying complex formation. We demonstrate that inter-protein coevolution is not a major determinant of assembly. Instead, complex structures are primarily driven by monomer geometry together with interface-level pattern matching between backbone complementarity and residue identities. By visualizing the iterative propagation of distance constraints during inference, we uncover a hierarchical process in which monomer-level constraints are established prior to cross-chain interactions, directly demonstrating that inter-chain geometry is inferred from monomer geometries rather than being encoded by coevolutionary signals. Application to antigen-antibody complexes further reveals that reduced prediction accuracy arises from the non-canonical and structurally plastic nature of immune interfaces, identifying accurate modeling of interface conformations and recognition of atypical antigen-antibody interaction patterns as key bottlenecks for improving immune complex prediction.

7
GROQ-seq Enables Cross-site Reproducibility for High-Throughput Measurement of Protein Function

Spinner, A.; Ross, D.; Cortade, D.; Ikonomova, S.; Baranowski, C.; Dhroso, A.; Reider Apel, A.; Sheldon, K.; Duquette, C.; Kelly, P. J.; DeBenedictis, E.; Hudson, C.

2026-04-09 bioengineering 10.64898/2026.04.07.716961 medRxiv
Top 0.1%
26.4%
Show abstract

High-throughput functional assays are increasingly used to generate large-scale protein function datasets for protein engineering and machine learning applications. However, the utility of such datasets depends on the reproducibility of the underlying measurements. Here we report reproducible, quantitative measurements of protein sequence-to-function data at scale across two facilities. We analyze GROQ-seq (Growth-based Quantitative Sequencing) measurements of three bacterial transcription factors. Independent barcode measurements of the same sequence produce highly consistent functional estimates, demonstrating strong biological reproducibility (across all transcription factors the mean Root Mean Square Deviation [RMSD] {approx} 0.53 and mean Spearman {approx} 0.63). We also compared experiments performed at two facilities using a shared protocol, but with differing levels of automation and system integration. We observe strong agreement between measurements taken at the two sites (mean RMSD {approx} 0.41 and mean Spearman {approx} 0.730). Orthogonal tests further support this agreement: a classifier trained to distinguish data by site performs near random (AUC = 0.559), and top-ranking variants show strong statistical overlap between experiments. Together, these results demonstrate that GROQ-seq enables reproducible, scalable measurement of protein function suitable for large aggregated datasets.

8
A Generative Neuro-Symbolic AI for Protein Sequence Design

Defresne, M.; Dessaux, D.; Buchet, S.; Barthe, L.; Ammar-Khodja, L.; Azizi, B.; Durante, V.; Cioci, G.; de Givry, S.; Roussel, A.; Garcia-Alles, L.; Schiex, T.; Barbe, S.

2026-04-02 bioengineering 10.64898/2026.03.31.715526 medRxiv
Top 0.1%
25.5%
Show abstract

Deep learning has revolutionized computational protein design, enabling the generation of sequences that fold onto target backbones with unprecedented accuracy. However, state-of-the-art inverse folding tools largely rely on auto-regressive sampling. While powerful, this paradigm is increasingly recognized for its inability to "think ahead", a crucial capacity to reliably create the complex, long-range inter-residue dependencies essential for most biological functions. To overcome these fundamental limitations, we introduce EffieDes, a generative neuro-symbolic AI framework that synergizes the predictive capabilities of deep learning with the logical precision of automated reasoning. EffieDes leverages deep learning to encode the target backbones fitness landscape into Effie-- a fully decomposable probabilistic graphical model (Potts model). This landscape is then rigorously explored by an automated reasoning prover to identify sequences that simultaneously satisfy complex design constraints and optimize backbone fitness. We validated this neuro-symbolic approach through the design of orthogonal sequence pairs that adopt identical folds but exhibit selective self-assembly, as well as the design of a de novo selective nanobody with nanomolar affinity for an immune-evasive SARS-CoV-2 variant. EffieDes provides a robust architecture for precisely dissecting learned fitness landscapes, offering a new path toward proteins with highly optimized performances and sophisticated functional objectives.

9
Predicting Unseen Gene Perturbation Response Using Graph Neural Networks with Biological Priors

Dip, S. A.; Zhang, L.

2026-03-26 bioinformatics 10.64898/2026.03.23.713780 medRxiv
Top 0.2%
23.3%
Show abstract

Predicting transcriptional responses to genetic perturbations is a central challenge in functional genomics. CRISPR Perturb-seq experiments measure gene expression changes induced by targeted perturbations, yet experimentally testing all possible perturbations remains infeasible. Computational models that infer responses for unseen perturbations are therefore essential for scalable functional discovery. We introduce PerturbGraph, a biologically informed graph-learning framework for predicting transcriptional responses of unseen gene perturbations by integrating interaction networks, functional annotations, and transcriptional features. Our approach is motivated by the observation that perturbation effects propagate through molecular interaction networks and manifest as coordinated transcriptional programs. Starting from single-cell CRISPR perturbation data, we construct perturbation signatures representing expression shifts relative to control cells and project them into a compact latent program space that captures stable transcriptional variation while reducing noise. Each gene is represented using enriched biological features integrating protein-protein interaction network embeddings, network topology statistics, baseline transcriptional characteristics, and Gene Ontology annotations. A graph neural network propagates information across the interaction network to infer perturbation programs for genes whose effects are not observed during training. Across unseen-perturbation benchmarks, PerturbGraph consistently outperforms classical machine learning models, perturbation-specific deep learning approaches such as scGen and CPA, and alternative graph neural architectures. The model achieves up to 6% improvement in cosine similarity over strong tree-based baselines and more than 20% improvement over linear models while improving recovery of differentially expressed genes. These results show that integrating biological interaction networks with graph representation learning enables accurate prediction of transcriptional effects for previously unobserved genetic perturbations. Code is publicly available at https://github.com/Sajib-006/PerturbGraph.

10
Benchmarking and Experimental Validation of Machine Learning Strategies for Enzyme Engineering

Zeng, Z.; Jin, J.; Xu, R.; Luo, X.

2026-03-30 bioengineering 10.64898/2026.03.29.715152 medRxiv
Top 0.2%
22.8%
Show abstract

Enzyme-directed evolution increasingly relies on computational tools to prioritize mutations, yet their practical value is difficult to assess because kinetic data are often aggregated across heterogeneous assay conditions, inflating apparent generalization. Here we introduce EnzyArena, a curated benchmark that groups kinetic parameters (kcat, Km, kcat/Km) into condition-matched experimental subsets to enable realistic evaluation. Using this resource, we benchmark 10 representative models from two arising strategy families--zero-shot fitness prediction and supervised kinetic-parameter prediction--across BRENDA- and SABIO-RK-derived subsets and 25 independent mutagenesis datasets. Kinetic-parameter predictors perform strongly on database-derived subsets but lose their advantage on independent datasets, whereas zero-shot predictors show more consistent generalization. A simple consensus of multiple zero-shot models further improves the precision of identifying beneficial mutants. We prospectively validated these findings in a wet-lab campaign (150 mutants) comparing random mutants, UniKP-prioritized mutants and ESM-1v-prioritized mutants (representing supervised kinetic-parameter prediction and zero-shot fitness prediction, respectively), where ESM-1v achieved the highest utility and UniKP underperformed the random baseline. Together, this study establishes realistic baselines for computational mutant prioritization and highlights consensus zero-shot strategies as a practical starting point for enzyme engineering.

11
When Multimodal Fusion Fails: Contrastive Alignment as a Necessary Stabilizer for TCR--Peptide Binding Prediction

Qi, C.; Wang, W.; Fang, H.; Wei, Z.

2026-04-02 bioinformatics 10.64898/2026.03.31.715453 medRxiv
Top 0.2%
22.7%
Show abstract

Multimodal learning is commonly assumed to improve predictive performance, yet in biological applications auxiliary modalities are often imperfect and can degrade learning if fused naively. We investigate this problem in TCR-peptide binding prediction, where sequence embeddings from pretrained protein language models are strong and transferable, but structure-derived residue graphs are built from predicted folds and heuristic discretization. In this setting, structural views can be noisy, inconsistent, and difficult to optimize jointly with sequence features. We introduce TRACE, a lightweight multimodal framework that encodes each entity (TCR and peptide) with parallel sequence and graph towers, then applies CLIP-style intra-entity contrastive alignment before interaction modeling. The alignment objective regularizes representation geometry by encouraging modality consistency for the same biological entity, thereby preventing unstable graph signals from dominating fusion. Across protocol-aware TCHard RN evaluations, naive sequence+graph fusion frequently underperforms a sequence-only baseline and can collapse toward near-random behavior. In contrast, TRACE consistently restores and improves performance. Controlled noise and supervision sweeps show that these gains persist under increasing graph corruption and positive-label scarcity, indicating that alignment is especially important when training conditions are hard. Our results challenge the assumption that adding modalities is inherently beneficial. Instead, they highlight a central principle for robust multimodal bioinformatics: performance depends not only on what modalities are used, but on how their interaction is constrained during optimization. TRACE provides a simple and general recipe for leveraging imperfect structural information without sacrificing stability.

12
Modeling gene regulatory perturbations via deep learning from high-throughput reporter assays

Venukuttan, R.; Doty, R.; Thomson, A.; Chen, Y.; Li, B.; Duan, Y.; Barrera, A.; Dura, K.; Ko, K.-Y.; Lapp, H.; Reddy, T. E.; Allen, A. S.; Majoros, W. H.

2026-03-31 bioinformatics 10.64898/2026.03.27.714770 medRxiv
Top 0.2%
22.5%
Show abstract

Assessing likely variant effects on phenotypes is of critical importance in diagnostic settings, and while much progress has been made in interpreting genic mutations based on our understanding of coding sequence, noncoding variants can be much more challenging to reliably interpret based on DNA sequence alone. High-throughput reporter assays such as STARR-seq and MPRA have shown utility in experimentally measuring regulatory effects of noncoding variants present in samples but provide no readout for variants not present in the assay inputs. However, whole-genome reporter assays provide copious data that can be used to train predictive models for prioritizing variants not directly observed in the experiment. We describe a retrainable predictive modeling framework, BlueSTARR, for this task, and present results of training several models with this framework on whole-genome STARR-seq data from two cell lines and one drug treatment. Using these models, we uncover a global signature across the human genome consistent with purifying selection against both loss-of-function and gain-of-function regulatory variants, with the latter showing a significant bias consistent with selection against gains of cis regulatory function in closed chromatin proximal to genes. By testing the model on synthetic enhancers with binding motifs for transcription factors GR and AP-1, we find that when trained on drug perturbation data, the model is able to learn distance-dependent and treatment-dependent binding patterns and their resulting reporter gene activation. These results demonstrate that lightweight, easily retrainable models such as ours have utility in probing latent signals present in novel experimental data. Finally, we find only modest differences in performance between different deep-learning architectures when trained on this single data modality, and while somewhat greater predictive accuracy can be achieved with much larger models trained at great expense on many terabytes of data, there is still copious room for improvement even for industrial strength, state-of-the-art models.

13
Resolution of recursive data corruption to transform T-cell epitope discovery

Preibisch, G.; Tyrolski, M.; Kucharski, P.; Gizinski, S.; Grzegorczyk, P.; Moon, S.; Kim, S.; Zaro, B.; Gambin, A.

2026-04-01 bioinformatics 10.64898/2026.03.30.710191 medRxiv
Top 0.2%
22.5%
Show abstract

Accurate prediction of MHC class I-presented peptides is essential for any vaccine or T-cell therapy design, yet reported gains on in silico benchmarks have not translated into clinical successes. Here we show that this discrepancy may come from a common methodological error: immunopeptidomics datasets are fundamentally contaminated by existing prediction models through prediction-based deconvolution and filtering, resulting in an iterative confirmation bias. An audit of the IEDB, the biggest database in the field, reveals that as of January 2025, 55.8% of assessable data are labeled by computational models rather than verified experimentally. This inflates in silico benchmarks while degrading real-world applicability on new data, effectively making it impossible to objectively test model performance, which can lead to choosing suboptimal solutions and decreasing the chance of any therapys clinical success. In silico simulation shows that iterative data corruption maintains high AUROC while top-of-list retrieval collapses. We reframe epitope discovery as a protein-centric learning-to-rank task and introduce deepMHCflare, a model evaluated exclusively on clean data. deepMHCflare achieves 0.80 Precision@4 on mono-allelic benchmarks versus 0.55-0.65 for gold-standard prediction models. A preclinical cancer vaccine study validated that 2 of the 4 deepMHCflare-nominated peptides were immunogenic, with a third independently confirmed in the literature.

14
Maximally Divergent Synonymous Gene Design with SIRIUS

Mohseni, A.; Wheeldon, I.; Lonardi, S.

2026-04-07 synthetic biology 10.64898/2026.04.06.716428 medRxiv
Top 0.2%
22.5%
Show abstract

The design of maximally divergent DNA sequences translating into the same protein is a critical problem in synthetic biology. Current design tools that rely on heuristics or machine learning often fail to effectively minimize the length of shared subsequences between the gene copies, compromising strain stability. Here, we introduce SIRIUS, a combinatorial optimization algorithm designed to generate maximally divergent coding sequences for a given protein of interest. Leveraging integer linear programming enforcing host-specific codon usage thresholds, SIRIUS stabilizes synthetic constructs and broadens the accessible design space for robust and scalable synethtic biology. Experimental results show that SIRIUS produces diverse sequences with fewer shared subsequences than existing methods. SIRIUS is freely available on GitHub at https://github.com/ucrbioinfo/sirius.

15
SCOPE: Localizing fate-decision states and their regulatory drivers in single-cell differentiation

Zhao, Y.; Finkbeiner, C.; Setty, M.; Lin, K.

2026-04-09 cell biology 10.64898/2026.04.07.717037 medRxiv
Top 0.2%
22.4%
Show abstract

Identifying the precise transcriptomic states at which cells commit to a lineage (branchpoints) and the temporal lag in which chromatin accessibility foreshadows gene expression (epigenetic priming) remain fundamental challenges in developmental biology. While current methods for single-cell sequencing data effectively capture developmental flow, they often lack a principled mechanism for delineating the discrete boundaries, a crucial aspect required to map the molecular logic of lineage commitment. We present SCOPE (Semi-supervised Conformal Prediction), a framework that transforms high-dimensional single-cell measurements into rigorous, discrete prediction sets of all plausible future fates. By formalizing fate uncertainty via conformal inference, SCOPE localizes the precise biological windows during which multipotent progenitors specify their fate. In multi-omic data, SCOPE uncovers epigenetic priming and identifies its driving transcription factors by detecting regimes where chromatin-derived prediction sets resolve toward terminal fates significantly before their transcriptomic counterparts. We apply SCOPE across simulations, lineage-traced mouse hematopoiesis, multiple human hematopoietic datasets, and human retinogenesis to demonstrate its broad applicability and ability to recapitulate known fate specification drivers. Ultimately, SCOPE provides a statistically grounded foundation for localizing fate decisions across biological replicates and modalities, offering a robust tool for identifying the onset of lineage specification in complex developmental systems.

16
PACMON: Pathway-guided Multi-Omics data integration for interpreting large-scale perturbation screens

Qoku, A.; Stickel, T.; Amerifar, S.; Wolf, S.; Oellerich, T.; Buettner, F.

2026-03-24 bioinformatics 10.64898/2026.03.20.713295 medRxiv
Top 0.2%
22.3%
Show abstract

High-throughput perturbation screens coupled with single-cell molecular profiling enable systematic interrogation of gene function, yet interpreting the resulting data in terms of biological pathways remains challenging. Existing approaches either identify latent gene modules without linking them to perturbations, or model perturbation effects without incorporating prior biological knowledge, limiting interpretability and scalability. Here, we introduce PACMON (Pathwayguided Multi-Omics data integration for interpreting large-scale perturbation screens), a Bayesian latent factor model that jointly infers pathway-level programs and their modulation by experimental perturbations. PACMON decomposes multimodal molecular measurements into shared latent factors aligned with known biological pathways through structured sparsity priors, while simultaneously estimating how each perturbation activates or represses these pathway programs. The framework naturally accommodates multiple data modalities and employs stochastic variational inference for scalable application to large datasets. We evaluate PACMON in three settings of increasing complexity. On synthetic data with known ground truth, PACMON achieves near-perfect recovery of pathway structure and perturbation effects, outperforming existing methods in both accuracy and computational scalability. Applied to a multimodal Perturb-CITE-seq screen of melanoma cells, PACMON recovers coherent interferon-signaling and cell-cycle programs spanning RNA and surface-protein modalities and identifies interpretable perturbation-pathway associations consistent with known immune-evasion mechanisms. Finally, we apply PACMON to the Tahoe-100M perturbation atlas -- approximately 100 million cells and over 1,000 drug-dose combinations -- producing the first pathway-level latent factor analysis at this scale and revealing biologically meaningful drug-response landscapes across Hallmark pathway programs. PACMON provides a unified, scalable and interpretable framework for mapping perturbation effects onto biological pathways in modern large-scale perturbation experiments.

17
Expanding the scope of redox-balance growth coupling techniques with a carbon cofeeding strategy

Cowan, A. E.; Cawthon, B.; Hillers, M.; Perea, S.; Grabovac, M.; Stanton, A.; Saleh, S.; Gin, J.; Chen, Y.; Petzold, C. J.; Keasling, J. D.

2026-04-05 bioengineering 10.64898/2026.04.01.713023 medRxiv
Top 0.2%
22.2%
Show abstract

Metabolic engineering to produce molecules not naturally synthesized by the host often requires directed evolution to improve pathway enzyme performance. Growth-coupled selection can dramatically increase directed-evolution throughput, and manipulation of redox balance has proven effective for tying reductase fitness to microbial growth. However, most redox-balance selections require feeding the reductase substrate because of stoichiometric constraints. This is impractical for many biosynthetic pathways either due to practical limitations on cost or complexity of bulk substrate synthesis, or the lack of an ability to transport substrate into cells, for example intracellular acyl-CoA/ACP intermediates. Here we define stoichiometric constraints that make substrate feeding necessary for many acetyl-CoA-derived reduction pathways in NADPH-imbalanced hosts. We overcome these constraints with a dual-feedstock strategy in which glucose provides reducing power while acetate supplies additional acetyl-CoA without directly perturbing redox balance. In an engineered E. coli selection strain, acetate co-feeding enabled growth coupling of acetaldehyde, 3-hydroxybutyrate, and mevalonate production and produced a linear correlation between product formation and growth. We then used this selection to evolve a class II HMG-CoA reductase (HMGR) from Delftia acidovorans toward NADPH utilization, enriching variants with improved NADPH-dependent activity. Finally, propionate co-feeding enabled growth coupling of propionyl-CoA reduction, supporting the generality of carbon co-feeding for selecting enzymes in pathways involving acyl-chain elongation and reduction. HighlightsO_LIStoichiometric limits of redox-balance growth coupling are defined C_LIO_LIAcetate co-feeding supplies acetyl-CoA without perturbing redox balance C_LIO_LICo-feeding enables growth coupling of acetaldehyde, 3-HB, and mevalonate C_LIO_LIGrowth coupling enables evolution of HMGR toward NADPH specificity C_LIO_LIPropionate co-feeding extends growth coupling to additional acyl-CoA substrates C_LI

18
ST-PARM: Pareto-Complete Inference-Time Alignment for Multi-Objective Protein Design

Yin, R.; Shen, Y.

2026-03-19 bioinformatics 10.64898/2026.03.17.712483 medRxiv
Top 0.2%
22.2%
Show abstract

MotivationProtein engineering is inherently multi-objective: improving one property can degrade others, so practical workflows require generating non-dominated (Pareto-optimal) candidates spanning a trade-off surface. Linear objective scalarization and deterministic pairwise preference learning can under-explore non-convex Pareto regions and amplify noise from uncertain evaluators, limiting Pareto coverage and trade-off controllability. ResultsWe introduce Smooth Tchebycheff Preference-Aware Reward Model (ST-PARM), an inference-time alignment framework that steers a frozen protein language model along user-specified trade-offs with a lightweight reward model trained only once. ST-PARM combines (i) a reward-calibrated pairwise preference loss that is uncertainty-aware by down-weighting ambiguous comparisons under noisy evaluators, (ii) a smooth Tchebycheff scalarization that is Pareto-complete in principle and improves empirical trade-off coverage, and (iii) latent-space pair-construction strategies. On GFP fluorescence-stability (full-length design) and IL-6 nanobody stability-solubility (CDR3+suffix design), ST-PARM delivers broader Pareto coverage and stronger preference tracking than baselines PARM and MosPro. For GFP, a conservative structural screen for local confidence and global fold preservation retains a broad frontier and strong controllability, yielding an actionable cohort for downstream assays. We also provide cross-evaluator robustness checks, a three-objective extension, and a natural-language alignment generality check in the Supplement, establishing a practical foundation for controllable sequence generation under competing multi-objectives and noisy measurements. Availability and Implementationhttps://github.com/Shen-Lab/ST-PARM. Supplementary InformationSupplementary data are provided with the submission.

19
X-Cell: Scaling Causal Perturbation Prediction Across Diverse Cellular Contexts via Diffusion Language Models

Wang, C.; Karimzadeh, M.; Ravindra, N. G.; Bounds, L. R.; Alerasool, N.; Huang, A. C.; Ma, S.; Gulbranson, D. R.; Cui, H.; Lee, Y.; Arjavalingam, A.; MacKrell, E. J.; Wilken, M. S.; Chen, J.; Herken, B. W.; Weber, J. A.; Onesto, M. M.; Gonzalez-Teran, B.; Leung, N. F.; Shi, S. Y.; Smith, B. J.; Lam, S. K.; Barner, A.; Wright, P.; Rumsey, E. M.; Kim, S.; Sit, R. V.; Litterman, A. J.; Chu, C.; Wang, B.

2026-03-20 systems biology 10.64898/2026.03.18.712807 medRxiv
Top 0.2%
22.1%
Show abstract

Causal models of cellular systems hold the promise to empower broad biological discovery, including the systematic identification of novel targets for drug discovery. Predicting how genetic and pathway perturbations reshape gene expression across diverse cellular contexts is a prerequisite for building generalizable cellular foundation models. However, current methods typically fail to extrapolate beyond their training distributions because they rely predominantly on observational expression atlases rather than interventional perturbation data. We present X-Atlas/Pisces, the largest genome-wide CRISPRi Perturb-seq compendium to date, comprising 25.6 million perturbed single-cell transcriptomes across 16 biologically diverse contexts, including widely used cell lines, induced pluripotent stem cells (iPSCs), resting and CD3/CD28 activated Jurkat T lymphoma cells, and multi-lineage differentiating iPSCs. Leveraging this resource, we develop X-Cell, a diffusion language model that predicts perturbation responses by iteratively refining control-to-perturbed state transitions through cross-attention to multi-modal biological priors derived from natural language, protein language models, interaction networks, genetic dependency maps, and morphological profiles. X-Cell outperforms existing state-of-the-art models by up to five-fold on key metrics such as Pearson{Delta} (correlation between predicted and observed perturbation-induced log-fold changes), and demonstrates zero-shot prediction of T cell inactivating perturbations in stimulated Jurkat cells. We scale X-Cell to 4.9 billion parameters (X-Cell-Ultra), the largest causal perturbation model to date. We demonstrate for the first time that perturbation prediction follows power-law scaling with an exponent matching large language models. X-Cell-Ultra demonstrates zero-shot generalization to novel biological contexts, including unseen iPSC-derived melanocyte progenitors and primary human CD4+ T cells from multiple donors, and outperforms all baselines after self-supervised test-time adaptation. These results demonstrate that coordinated scaling of causal perturbation data and model capacity yields foundation models capable of generalizable perturbation prediction across cellular contexts, with potential applications for improving computational target identification, validation, and context-specific therapeutic prioritization.

20
Improved inference of multiscale sequence statistics in generative protein models

Chauveau, M.; Kleeorin, Y.; Hinds, E.; Junier, I.; Ranganathan, R.; Rivoire, O.

2026-04-09 systems biology 10.64898/2026.04.06.716859 medRxiv
Top 0.3%
22.0%
Show abstract

High dimensionality and multiscale statistical structure are pervasive features of biological data, posing fundamental challenges for modeling. Because model inference generally proceeds with far fewer data than parameters, statistical patterns across scales are often unevenly represented. Protein sequences provide a paradigmatic example: statistics across homologs are inherently multiscale, displaying collective correlations among conserved residue sectors that encode function, alongside localized correlations corresponding to physical contacts outside these sectors. Standard regularization strategies used to mitigate undersampling during model inference have been shown to capture these patterns unevenly, a bias that compromises generative models of protein sequences by limiting their ability to produce both functional and diverse proteins. This limitation is exemplified by Boltzmann machine-based generative models, which so far have required post hoc corrections to recover functionality, at the cost of reduced sequence diversity. Here, we introduce the stochastic Boltzmann Machine (sBM), a new regularization strategy that more accurately captures different correlation scales. Through analyses of theoretical models with known ground-truth parameters and experiments on the chorismate mutase family, we show that sBM effectively mitigates distortions in the estimation of model parameters, enabling the generation of functional sequences with greater diversity and without the need for post hoc corrections. These results advance the inference of generative models that more faithfully reflect the evolutionary constraints shaping protein sequences.